0911.1112v2 [cs.IR] 6 Nov 2009 


. 
. 


arXıv 


Memento: Time Travel for the VVeb 


Herbert Van de Sompel 
Los Alamos National 
Laboratory, NM, USA 


herbertv@lanl.gov 


Lyudmila L. Balakireva 
Los Alamos National 
Laboratory, NM, USA 


ludab@lanl.gov 


ABSTRACT 


The Web is ephemeral. Many resources have representa- 
tions that change over time, and many of those represen- 
tations are lost forever. A lucky few manage to reappear 
as archived resources that carry their own URIs. For ex- 
ample, some content management systems maintain version 
pages that reflect a frozen prior state of their changing re- 
sources. Archives recurrently crawl the web to obtain the 
actual representation of resources, and subsequently make 
those available via special-purpose archived resources. In 
both cases, the archival copies have URIs that are protocol- 
wise disconnected from the URI of the resource of which 
they represent a prior state. Indeed, the lack of temporal 
capabilities in the most common Web protocol, HTTP, pre- 
vents getting to an archived resource on the basis of the 
URI of its original. This turns accessing archived resources 
into a significant discovery challenge for both human and 
software agents, which typically involves following a mul- 
titude of links from the original to the archival resource, 
or of searching archives for the original URI. This paper 
proposes the protocol-based Memento solution to address 
this problem, and describes a proof-of-concept experiment 
that includes major servers of archival content, including 
Wikipedia and the Internet Archive. The Memento solution 
is based on existing HTTP capabilities applied in a novel 
way to add the temporal dimension. The result is a frame- 
work in which archived resources can seamlessly be reached 
via the URI of their original: protocol-based time travel for 
the Web. 
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1. INTRODUCTION 


“The web does not work,” my eleven year old son com- 
plained. After checking power and network connection, I 
realized he meant something rather more subtle. The URI 
(http://stupidfunhouse.com) he had bookmarked the year 
before returned a page that didn’t look like the original at 
all, and definitely was not fun. He had just discovered that 
the web has a terrible memory. 

Let us restate the obvious: the Web is the most pervasive 
information environment in the history of humanity; hun- 
dreds of millions of people! access billions of resources? using 
a variety of wired or wireless devices. The rapid growth of 
the Web was made possible by a suite of relatively simple, 
yet powerful technologies including TCP/IP, URI, HTTP, 
and HTML. The Web is also highly dynamic, with a sig- 
nificant percentage of resources changing at different rates 
over time [3, 9, 18, 25]. Given the ubiquity of the Web, it 
is rather surprising to find how poor its memory is regard- 
ing these continuous changes. Indeed, once a resource has 
changed, accessing one of its prior versions becomes a sig- 
nificant discovery challenge, no longer merely a matter of 
using Web protocols to dereference its URI. 

In essence, the time dimension is absent from the most 
common of Web protocols, HTTP. This timelessness is even 
written into the W3C's Architecture of the World Wide Web 
[15], which reminds us that dereferencing a URI yields a rep- 
resentation of the (current) state of the resource identified 
by that URI, and highlights the impracticality of keeping 
prior states accessible at their own distinct URIs: 


Resource state may evolve over time. Requiring a 
URI owner to publish a new URI for each change 
in resource state would lead to a significant num- 
ber of broken references. For robustness, Web 
architecture promotes independence between an 
identifier and the state of the identified resource. 


Nevertheless, the Web does contain a meaningful amount 
of records of the past. Sites based on Content Manage- 
ment Systems (CMS) such as Wikimedia, the platform used 
by Wikipedia, keep the current version of a page accessible 
at a generic URI, while older versions remain accessible at 


T http://www.internetworldstats.com/stats.htm 


2 http://googleblog.blogspot.com/2008/07 / we-knew-web-was- 
big.html 


version-specific URIs. Special-purpose services that are con- 
cerned with persistent referencing, such as WebCite, store 
a representation of the resource retrieved at the time the 
service is invoked. Also, inspired by the pioneering work of 
the Internet Archive, there is an ever-growing international 
Web Archiving [6, 22] activity that consists of recurrently 
sending out crawlers to take snapshots of Web resources, 
storing those in special-purpose distributed archives, and 
making them accessible through tools such as the Wayback 
Machine?. Transactional archives [11] store every materi- 
ally different representation of a web server’s resources as 
they are being delivered to clients. Currently, their use is 
primarily restricted to applications that need to meet spe- 
cial legal requirements, such as keeping an exact record of 
what has been delivered to users of an ecommerce or gov- 
ernment site, and they are therefore typically not openly 
accessible. Exploratory work is ongoing regarding the es- 
tablishment of a peer-to-peer web archive that receives its 
content from browser caches, and that therefore can be con- 
sidered a client-side transactional archive [4]. Also personal 
client-side transactional archives have been proposed [7, 30, 
32] but their private purpose excludes accessing them on 
the Web. Search engine caches may also contain prior rep- 
resentations of resources, but they are restricted to the most 
recent snapshot taken by a crawler. 

Although this variety of archival solutions exists and their 
coverage is growing, accessing last year’s version of a re- 
source remains a significant challenge. In the case of Wiki- 
pedia, one has to resort to its History tab and navigate the 
sometimes thousands of entries there. The situation is sim- 
ilar for most other version-aware sites. For news sites, one 
may find the answer by searching the site’s special purpose 
archive if one exists. And, as an option of last resort un- 
known to many Web users, one can individually search the 
many Web Archives, hoping to find a page that was archived 
at a time close to the desired one. This situation is cum- 
bersome for users who, for example, want to revisit a book- 
marked resource as it existed at the time of bookmarking. 
Research has indicated that anywhere between 50% and 80% 
of page visits are revisits [2, 26, 31]. To an extent, this find- 
ing emphasizes the need for end-user time travel on the Web. 

The poor integration of archival content in regular Web 
navigation is also a fundamental hindrance to applications 
that require finding, analyzing, extracting, comparing, and 
otherwise leveraging historical Web information. Examples 
include Zoetrope, a tool that allows interaction with and vi- 
sualization of high-resolution temporal Web data [1]; DiffIE, 
a Web browser plug-in that emphasizes Web content that 
changed since a previous visit [32]; and time-oriented search 
that tracks the frequency of words and phrases in resources 
over time [20]. These applications must build their own 
Special-purpose archives in an ad-hoc manner in order to 
achieve their goals. 

In this paper, we present the Memento solution to allow 
temporal access to the Web. Our solution is based on and is 
as simple as the technologies that led to the rapid growth of 
the Web. It focuses on seamless access to archival content 
(irrespective of its location) as part of regular Web naviga- 
tion for both human and software agents. It does not deal 
with the aspect of creating, populating, and maintaining 
archives, but rather leverages their existence. The remain- 


3 http: //www.archive.org/ 


der of the paper is structured as follows: Section 2 briefly re- 
views transparent content negotiation for HTTP in order to 
allow a better understanding of Section 3 which introduces 
the Memento solution for time travel on the Web; Section 
4 describes an experiment that provides a proof of concept 
for the solution; Section 5 discusses open issues; and Section 
6 provides an overview of related work; Section 7 holds our 
conclusion. 


2. CONTENT NEGOTIATION 


Transparent Content Negotiation for HTTP [14] (from 
here on abbreviated as conneg) allows a client to select which 
representation it wants to retrieve from a transparently ne- 
gotiable resource; that is, a resource that has multiple rep- 
resentations (variants) associated with it, each of which is 
available from a variant resource. Currently deployed di- 
mensions that are open to conneg are media type, language, 
compression, and character set. A client expresses prefer- 
ences, possibly according to multiple dimensions, in special- 
purpose HTTP Accept headers. Preferences are qualified 
with “quality”, or “q”, values, that have a normalized value 
of 1.0 — 0.0 (an argument without a q value is assumed to 
have q=1.0). For example, by using the header “Accept- 
Language: en, fr;q=0.7” the client indicates that English is 
preferred and French is acceptable. Based on information in 
these headers, a server will either: 


e Select an appropriate representation: There are two 
ways for a server to do so. One way is to provide 
a “HTTP 200 OK” response with a “TCN: Choice” 
header, and a Content-Location header that indicates 
the URI of the variant resource that delivered the rep- 
resentation. The other is to provide a “HTTP 302 
Found” response with a “TCN: Choice” header, and a 
Location header that indicates the URI of where the 
client can access the variant resource. 


e Respond with a “HTTP 406 Not Acceptable” response 
if the server cannot meet the client’s preferences as 
stated in the request. The server then also returns 
a “TCN: List” header and a list of variant resources 
it possesses that are associated with the requested re- 
source. The client can then make an informed decision 
about variant selection.* 


RFC 2295 proposes a format for these lists, expressed as 
an Alternates response header that can be used in both the 
Choice and List scenarios. Web servers do not necessarily 
support all the negotiation dimensions for all of their re- 
sources, but do indicate the supported dimensions to clients 
(e.g., “Vary: negotiate, accept-language" if the language di- 
mension is supported). Also, note that according to RFC 
2295, variant resources do not themselves support content 
negotiation?. 

As an example, presume a transparently negotiable re- 
source http://an.example.org/paper for which the following 
variant resources are available: the paper in HTML and En- 
glish (paper.html.en), in PDF and English (paper.pdf.en), 


“The client can also force a “ HTTP 300 Multiple Choices” response 
by issuing a “Negotiate: 1.0” request header. This rarely occurs in 
practice, but the response is functionally equivalent to a “HTTP 406 
Not Acceptable” response. 


5 
” Servers must return a “ATTP 506 Variant Also Negotiates” response 
if variant resources support conneg. 


and in PDF and French (paper.pdf.fr). Now presume a client 
wants to access the paper and has a preference for HTML 
and English. The interaction, in which the server makes a 
choice that fully honors the client’s preferences, would then 
be (only headers relevant for conneg are shown): 

GET /paper HTTP/1.1 

Host: an.example.org 


Accept: text/html, application/pdf ;q=0.8 
Accept-Language: en-US, fr;q-0.7, de;q-0.5 


HTTP/1.1 200 OK 

TCN: choice 

Vary: negotiate, accept, accept-language 

Content-Location: /paper.html.en 

Content-Type: text/html 

Content-Language: en 

Alternates: 
{"paper.html.en" 1.0 {type text/html} {language en}}, 
{"paper.pdf.en" 0.8 {type application/pdf} {language en}}, 
{"paper.pdf.fr" 0.6 {type application/pdf} {language fr}} 


However, if the client prefers PDF over HTML and in- 
sists only on German language documents (French and En- 
glish have q=0.0), the interaction in which the server cannot 


honor the request, and leaves the choice to the client would 
be: 


GET /paper HTTP/1.1 

Host: an.example.org 

Accept: application/pdf, text/html;q-0.8 
Accept-Language: de, fr;q-0.0, en-US;q=0.0 


HTTP/1.1 406 Not Acceptable 

TCN: list 

Vary: negotiate, accept, accept-language 

Alternates: {"paper.pdf.fr" 0.8 {type application/pdf} 
{language fr}}, {"paper.html.en" 0.5 {type text/html} 
{language en}}, {"paper.pdf.en" 0.4 
{type application/pdf} {language en}} 


3. THE MEMENTO SOLUTION 


In this section, we introduce the two core building blocks 
of the Memento solution to allow temporal navigation of the 
Web: HTTP content negotiation in the datetime dimension, 
and an API for archives of web resources that allows request- 
ing an inventory of available archived resources associated 
with a resource with a given URI. 


3.1 A Memento: An Archival Resource 


We introduce the term Memento to refer to an archival 
record of a resource. More formally, a Memento for a re- 
source URI-R (as it existed) at time t; is a resource URI- 
M;[URI-R@t;] for which the representation at any moment 
past its creation time te is the same as the representation 
that was available from URI-R at time t;, with t. > t;. Im- 
plicit in this definition is the notion that, once created, a 
Memento always keeps the same representation. 

In the remainder of this paper, the term original resource 
is used to refer to a resource that itself is not a Memento 
of another resource, and URI-R is used to denote its URI. 
URI-M is used to denote the URI of a Memento. 


3.20 HTTP Datetime Content Negotiation 


We introduce the notion of content negotiation in the 
datetime dimension (from here on abbreviated as D'T-conneg), 
allowing a client to indicate that it is looking for past rather 
than current representations of a resource. This is achieved 
by using a special-purpose Accept header, experimentally 
named X-Accept-Datetime, which has datetimes (rather 
than media type or similar) as its value: 


X-Accept-Datetime: (Sun, 06 Nov 1994 08:49:37 GMT) 


Generally speaking, D'T-conneg works in very much the 
same way as existing conneg approaches: If a client wants 
to retrieve a Memento of the original resource URLR, it is- 
sues an HTTP GET at URI-R using the X-Accept-Datetime 
header to express the datetimes of the archival record(s) of 
URLR in which it is interested. The server handling this 
HTTP GET request tries to honor it by delivering a rep- 
resentation it chooses based on the client's datetime prefer- 
ence(s), and/or by providing the client with a list of available 
variant resources, each of which is a Memento of URI-R. De- 
scribed in more detail below, two distinctions exist between 
D'T-conneg and other conneg approaches: 


e Cases exist in which the server hosting URI-R can not 
itself honor the D'T-conneg request, but instead redi- 
rects to a server that can. 


e The list of available variant resources can be too exten- 
sive to be expressed in an Alternates header. In this 
case, a combination of a sizeable Alternates header 
listing variants centered on the requested datetime(s), 
and an HTTP Link header pointing at an extensive 
list of variants is used. 


Before deciding on the X-Accept-Datetime header, we in- 
vestigated possible alternatives that could be used in HTTP 
interaction. We decided not to use the "features" exten- 
sibility mechanism introduced by RFC 2295 because it is 
geared at the fine-grained specification of variant options 
(e.g., paper size, color depth) and hence is not suitable for 
something with the primacy of datetime. Also, the ongoing 
Media Fragment work of the W3C [33] is not applicable be- 
cause it proposes expressing a segment of a resource (e.g., 
a region of an image, a section of a video) as a URI frag- 
ment. It does not deal with the notion of a resource that 
has changing representations over time. 


3.3 A TimeGate: A Resource Capable of DT- 
conneg 


We introduce the term T'imeGate to refer to a transpar- 
ently negotiable resource that supports the datetime dimen- 
sion. More formally, a TimeGate for an original resource 
URLR is a transparently negotiable resource URI- G[URI-R] 
for which all variant resources are Mementos URI-M;[URI- 
R@t;] of the resource URI-R. Since multiple archives may 
host versions of URI-R, multiple TimeGates may exist for 
any given resource, i.e. one per archive. 


3.4 Time Travel: Combining DT-conneg and 
TimeGates 


To further explain DT-conneg and TimeGates, two sepa- 
rate scenarios are explored. The combination of these sce- 
narios provides a solution for temporal Web navigation that 
integrates operational web servers and archives of all types. 
To allow for a better understanding, the description is re- 
stricted to conneg in the datetime dimension only. Also, in 
order to keep examples simple, requests with multiple date- 
time values and associated q-values are not used. It should 
be noted, however, that both multi-dimensional conneg, and 
multiple datetime values are possible in the proposed frame- 
work, since it builds on the principles of RFC 2295 that 
provides these capabilities. Furthermore, we assume that 


the server that hosts the original resource URI-R for which 
a client wants to retrieve Mementos, is able to detect the 
existence of an X-Accept-Datetime header. 

Before describing the scenarios, let us provide some ex- 
planatory information about the HTTP headers that are 
involved: 

Alternates: RFC 2295 requires listing all variant resources. 
However, since an extensive set of variant resources may ex- 
ist in case of DT-conneg, the Alternates listing is imprac- 
tical. Therefore, Alternates only lists a limited amount of 
variant resources, centered on the datetime requested by the 
client. 

Link: To compensate for the incomplete list of variant 
resources in Alternates, an HTTP Link header [23] provides 
a pointer to a resource (the TimeBundle, see Section 3.5) 
that supports retrieving a list of all variant resources (Me- 
mentos), and their associated metadata. 

X-Archive-Interval: Indicates the entire datetime interval 
for which the archival server has Mementos for URI-R. 

X-Datetime- Validity: Indicates the datetime interval dur- 
ing which the provided representation was valid. Certain 
servers, including CMS and transactional archives, can re- 
liably provide this information. Others, such as crawler- 
driven web archives cannot. 


3.4.1 Web servers with archival capabilities 


Some web servers handle aspects of resource archiving na- 
tively, by maintaining explicit information about the loca- 
tion and datetimes of archival records of their resources, 
stored internally or remotely. Many CMS, Version Control 
Systems, as well as the TTApache system [8] fall under this 
category. But also servers that recurrently archive into a 
cloud store and keep track of the URIs of the remote archival 
records fit in. 

When a client is looking for Mementos of an original re- 
source URI-R hosted by these servers, they can handle the 
requests internally since all the information that is required 
— URIs of Mementos and their datetimes — is available. In 
this case, the set-up is as follows: 


e URI-R itself becomes a transparently negotiable re- 
source that supports D'T-conneg to provide access to 
all its available Mementos. In essence, URI-R func- 
tions as its own TimeGate URI-G. Note that typical 
URLRs for these systems either provide access to the 
current version of a resource, or to a list of all its ver- 
sions (each with its own URI-M), or to a combination 
of both. 


e All Mementos URI-M;[URI-R@t;] of URI-R become 
variant resources for URI-R. 


Figure 1 depicts a typical, successful, D'T-conneg transac- 
tion flow for this type of server, including the HTTP head- 
ers that are used. The transactional behavior for less trivial 
cases are also considered in the Memento solution but space 
prevents us from discussing them here. Such cases include 
requesting Mementos for datetimes that are out of the date- 
range for which the server has archival records, requesting 
Mementos for URI-Rs that no longer exist, and the client 
providing a datetime which the server is unable to parse?. 


“Details: http://mementoweb.org/guide/http/local 


3.4.2 Web servers without archival capabilities 


Many other servers have no local archival capabilities 
whatsoever. They host resources for which only a represen- 
tation of the current state can be retrieved, and are unaware 
of the details regarding the existence of Mementos of their 
resources in other archival servers. Naturally, such a server 
cannot redirect a client that requests an archival record of 
one of its URI-Rs to an appropriate Memento. However, 
these systems can still play a constructive role by redirecting 
the client to a server that is equipped to handle the request: 
an archive of web resources. In this case, the set-up is as 
follows: 


e Upon detection of the X-Accept-Datetime header in 
the client’s request for URI-R, the server merely redi- 
rects (using “HTTP 302 Found”) the client to an 
archival server. Note that this is not a 302 redirection 
that is part of a conneg transaction, as described in 
Section 2. Rather it is a 302 redirection that results 
from detecting the X-Accept-Datetime header. 


e The redirection is to a TimeGate 
URI-G[URI-R] that the archival server makes available 
for the original resource URI-R. 


e The archive’s URI-G is a transparently negotiable re- 
source that supports DT-conneg to provide access to 
all the Mementos that the archive has available for 
URLR. 


e All Mementos URI-M;[URT-ROt;] that the archive has 
available for URI-R become variant resources for its 
URLG. 


Figure 2 depicts a typical, successful, D'T-conneg trans- 
action flow for this type of server, and includes the HTTP 
headers that are involved. Again, the transactional behav- 
ior for less trivial cases is not covered here”. In essence, the 
solution is the same as in the above case, with the exception 
that the TimeGates reside on an external archival server, 
not on the server that hosts the original resource URI-R. 
This distinction raises two important questions. 

First, to which archive should a server redirect? In order 
to help the client, a server should redirect to an archive that 
has the best archival coverage of its resources. Servers that 
have an associated transactional archive should redirect to 
it, servers that have explicit recurrent crawling agreements 
with systems such as Archive-It® should point there, other 
servers may point at their country-specific archive (such 
as the Finnish, Danish, Canadian, etc. archives), and in 
many cases servers can point at the Internet Archive. Note 
that scenarios may be envisioned in which the redirection is 
subject to configuration, for example, redirection to differ- 
ent archives depending on archival time-range, media type, 
etc. Then again, this problem of redirecting to a specific 
archive could be addressed by uniformly pointing at an ag- 
gregator service that holds crucial metadata (e.g., URI-R, 
URLG, URI-M, t;) about Mementos available in a variety of 
archival servers, and that exposes cross-archive TimeGates 
URI-G[URLR]. In Section 3.5, we introduce a discovery API 
for archives that enables the creation of such a TimeGate 
aggregator. 


7 Details: http://mementoweb.org/guide/http/remote 
2 http:/ /vvvvvr.archive-it.org/ 
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Second, how does the server know the URI-G of the 
TimeGate for its own URI-R on an external archival server? 
This problem can be addressed by introducing archive- 
specific or cross-archive conventions for the syntax for URI- 
G of TimeGates as a function of URI-R. This would simply 
formalize the status-quo as all major web archives that use 
the Heritrix/Wayback solution already use such conventions. 
For example, the URI to retrieve a list of all archived ver- 
sions of http://cnn.com/ is: 


http://web.archive.org/web/*/http://cnn.com/ 


Hence, the URI that could be used as a convention for the 
Internet Archive’s TimeGate for http://cnn.com/ would be: 


http://web.archive.org/web/timegate/http://cnn.com/ 


Such a convention seems achievable in the context of the 
International Internet Preservation Consortium? that has 
made archive interoperability one of its goals. However, 
when a TimeGate aggregator service is introduced, URI- 
G syntax conventions for individual archives are not crucial; 
only a convention for the aggregator's URI-G syntax would 
be essential. 


3.5 Discovering Mementos: TimeBundles and 
TimeMaps 

For discovery purposes, we introduce the notion of a re- 
source hosted by an archival server, via which a full overview 
is available of all Mementos that the archive holds for an 
original resource URI-R; we name such a resource a T'ime- 
Bundle. More formally, a TimeBundle for a resource URI-R, 
is a resource URI-B[URI-R] that is an aggregation of: (a) all 
Mementos URI-M; [URI-RGt;] available from an archive, (b) 
the archive's TimeGate URI-G for URI-R, (c) the original 
resource URI-R itself. 

Given the semantics of a TimeBundle, as an aggregation of 
a set of resources, all of which share a temporal relationship 
with URI-R, we propose to model it as an ORE Aggrega- 
tion [34]. The ORE specifications comply with the Linked 
Data conventions [5], and treat an ORE Aggregation as a 
non-information resource [21] described by an information 
resource that is accessible via an HTTP 303 redirect from 
the URI of the ORE Aggregation. We name the information 
resource that describes the TimeBundle a TimeMap; it is a 
specialization of an ORE Resource Map. The TimeMap lists 
the URIs of all resources that are aggregated in the Time- 
Bundle, as well as metadata that is available about them. 
We have not formally engaged in specifying which meta- 
data to convey in TimeMaps, but essentials such as archival 
datetime, media type, and language, as well as more specific 
information such as digest, number of observations, validity 
time-range [4] must be considered!?. 

'TimeBundles made available by archives may be leveraged 
in real-time client interaction, since their URI-B is expressed 
as the content of the HTTP Link header (see the HTTP 
headers in Figures 1 and 2). And, when an archive makes its 
TimeBundles discoverable using common approaches such 
as SiteMaps [12], Atom Feeds [24], or OAI-PMH [19] they 
become a powerful mechanism for batch harvesting of meta- 


P http:/ /www.netpreserve.org/ 


An example RDF/XML TimeMap as used in our experiment is avail- 
able at http://mementoweb.org/guide/api/map1 


data that describes an archive's entire collection, and that 
can be used for the creation of cross-archive services. 


3.6 A TimeGate Aggregator 


If various archives implement TimeBundles and associated 
TimeMaps, and make them discoverable using the aforemen- 
tioned techniques, then information about Mementos hosted 
by different archives can be harvested into an aggregator 
service. For each original resource URI-R, for which Me- 
mentos exist in the harvested archives, such an aggregator 
then minimally holds the distinct URI-Ms of each of those 
Mementos in the various archives, as well as their archival 
datetime, media type, language etc. This information allows 
the aggregator to introduce TimeGates URI-G for each of 
the URI-Rs for which the harvested archives have Memen- 
tos. The variant resources for any specific TimeGate are 
the Mementos for URI-R as they exist in the distributed 
archives. Because the aggregator has information on Me- 
mentos across archives, its time-granularity is finer than that 
of any of the individual archives. This provides the aggre- 
gator with a better range of possibilities when redirecting a 
client to a Memento in response to a request for a specific 
datetime. In essence, this aggregator behaves as the archival 
servers discussed in Section 3.4.2, but it has a broader cov- 
erage both regarding URI-Rs and Memento datetimes, and 
it does not store the Mementos itself. 

Figure 3 illustrates the value such an aggregator can bring 
to time travel. It shows various Mementos for the noaa.gov 
home page as it was around the time of Hurricane Katrina. 
In order to revive how the drama unfolded, inspecting Me- 
mentos held by different archives is required. Indeed, both 
the content of the Mementos as well as their archival server 
changes as time progresses. Note also that, although the In- 
ternet Archive claims to have coverage for September 9 2005, 
the Memento is not really available (bottom left of Figure 
3; it is not known if this is a permanent or transient error); 
the next available Memento is for September 10 2005, and is 
available from Archive-It. In cases like this, an aggregator 
could support navigation across archives and across time. 


4. EXPERIMENT 


We have performed an experiment to demonstrate the fea- 
sibility of the proposed DT-conneg framework involving a 
diverse array of components that jointly realize web time 
travel across various servers. The deployed environment is 
depicted in Figure 4. The arrows indicate the flow of HTTP 
interactions shown in Figures 1 and 2, subject to the follow- 
ing considerations that are directly related to conducting a 
time travel experiment in a Web that is not (yet) DT-conneg 
enabled. 

First, as it was not realistic to try and get active develop- 
ment involvement from existing archival servers within the 
timeframe the reported work took place, TimeGates and 
TimeBundles for several archives (CMS and web archives) 
were not implemented natively within those systems but 
rather by-proxy. This means that they were exposed by 
servers under our control, which obtained the essential in- 
formation from the archives using ad-hoc techniques such 
as screen scraping. While it may seem that this approach 
undermines the essence of the protocol-based D'T-conneg 
framework, it actually is a strong illustration of its feasibil- 
ity: if one can scrape the essential information from archives' 
pages, it is certainly available in their databases, and hence, 
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Figure 3: Distributed archive coverage of www.noaa.gov. 3(a) and 3(d) come from Archive-It and 3(b) and 
3(c) come from the Internet Archive. Note that 3(c) has either a transient or permament error. 


native implementation should be more straightforward than 
by-proxy. Also, while we rely on a by-proxy approach for 
certain systems, demonstrations of the feasibility of native 
implementation are also available. 

Second, existing web servers do not currently detect the X- 
Accept-Datetime header required for time travel, and hence 
cannot issue the essential “HTTP 302 Found" to a TimeGate 
(see Section 3.4.2). These servers will currently respond as 
usual, typically with an “HTTP 200 OK” or “HTTP 404 
Not Found”. In order to still be able to demonstrate the 
D'T-conneg framework in the experiment, the remedy is to 
have the time travel client detect such responses that are un- 
expected from the time travel perspective, and take control 
by subsequently issuing the DT-conneg request directly to a 
TimeGate for URI-R exposed by an archival server (native 
or by-proxy). In essence, in these cases, the client fulfills the 
redirecting role that the host of URI-R normally would in 
the DT-conneg framework. For servers outside of our con- 
trol, there was no other option than to resort to this client 
approach; for servers under our control the redirect to a 
TimeGate was implemented natively. 

The following is a description of the components involved 
in the experiment: 

Web servers: We equipped domains under our own con- 
trol with the capability to honor DT-conneg requests by 
detecting the X-Accept-Datetime header, and redirecting to 
TimeGates exposed by an appropriate archival server. This 
was trivially implemented using an Apache mod_rewrite 
rule!! for servers we could configure: 


http://lanlsource.lanl.gov/ 
http://odusource.cs.odu.edu/ 
http://digitalpreservation.gov/ 


(LANL, ODU, and LoC, respectively in Figure 4). For ob- 
vious reasons, we were not able to implement this for servers 
beyond our control. 

Archives: Wikipedia is a prominent example of the class 
of servers with local archival capabilities. TimeGates (and 
TimeBundles) for it were implemented by-proxy (Wiki proxy 
in Figure 4). However, to demonstrate the possibility of 
native implementation, a plug-in was developed that adds 
X-Accept-Datetime and TimeGate capabilities to the Wiki- 
media platform on which Wikipedia is based!?. 'To cover 
for the class of servers that lack local archival capabilities, 
TimeGates (and TimeBundles) were implemented by-proxy 
for the Internet Archive (IA proxy in Figure 4), the Inter- 
net Archive's Archive-It (AI proxy in Figure 4), the Library 
of Congress' Archive-It, the Government of Canada Web 
Archive, and WebCite. In addition, we developed a transac- 
tional archive platform and deployed it at LANL and ODU 
(LANL TA and ODU TA in Figure 4, respectively). As the 
LANL and ODU web servers respond to client requests, the 
representations they serve are pushed into these archives, 
yielding a high-resolution archival record of their evolving 
resources. It is worth noting that the described selection 
covers a broad range of commonly deployed archival solu- 
tions: CMS, web-crawler based archives, on-user-demand 
archives, and transactional archives. 

Aggregator: Furthermore, a TimeGate aggregator (Aggr 


1566 http://mementoweb.org/tools/apache 
1?plug.in at http://mementoweb.org/tools/wiki 


in Figure 4) was developed that collects archival metadata 
from the aforementioned web archives’ TimeBundles (some 
by-proxy and some native), and can hence serve as a com- 
mon target for redirection. This collecting is currently done 
dynamically: as a client requests a Memento for an original 
resource URI-R via the aggregator, the aggregator contacts 
associated TimeBundles in various archives, merges the re- 
turned TimeMap information, and only then redirects the 
client to an appropriate Memento. This experimental ap- 
proach makes retrieving Mementos via the aggregator pre- 
dictably slow. 

Clients: We developed a FireFox plug-in that allows set- 
ting the browser to time travel mode, and selecting a date- 
time for the journey. From there onwards, the browser adds 
an X-Accept-Datetime header, with the datetime value set 
by the user, to every HTTP GET issued. If all targeted 
servers would implement the “HTTP 302 Found” redirec- 
tion upon detection of the X-Accept-Datetime header, only 
archival pages would be retrieved, and all links in those 
pages would be interpreted as requests for Mementos. This 
effectively happens for the servers under our control (the 
black flows labeled [1], [2] and [4] in Figure 4). As described 
above, other servers do not exhibit this behavior (the red 
flows [3] and [5] in Figure 4). Implementing the remedial 
behavior where the client itself takes care of the redirection 
turned out not to be trivial in the Mozilla plug-in frame- 
work as it does not support intercepting and modifying re- 
sponses? (e.g., on 404 or 200 response codes). The result is 
a time travel plug-in that deals perfectly with URI-Rs of the 
servers under our control but not with any others. We then 
decided to develop a time travel client that runs on a server 
and is developed using the Apache mod. python framework 
that offered the required flexibility. "The resulting gate- 
way client handles all flows of Figure 4 correctly, and fully 
demonstrates the potential of the D'T-conneg framework. It 
is accessible via a web form that allows entering URI-R and 
a datetime. Upon submitting the time travel request, the 
gateway client (not the browser) fulfills the D'T-conneg re- 
quests, and once completely handled, returns the resulting 
Memento page to the browser. In order to allow for contin- 
ued time travel of links in the page, they need to be rewritten 
to point at the gateway client. This is merely an artifact of 
a server-side, not a browser-based, implementation. This 
client also depicts the HTTP transactions that take place 
during time travel, and allows inspecting the HTTP head- 
ers involved. 

With the above components in place, an experimental en- 
vironment results that effectively demonstrates the feasibil- 
ity of web time travel using the Memento solution. Two 
clients, both admittedly with respective restrictions, allow 
navigating the past Web in very much the same way as 
the current Web is browsed; they seamlessly move across 
web servers and archives (CMS-style and web archives) us- 
ing the HTTP protocol, extended with DT-conneg, to try 
and return a Memento that meets the client’s preference. 
Due to the various by-proxy components, and the dynamic 
implementation of the aggregator, the navigation can often 
be slow. However, the navigations that involve the servers 
with full native support (flows [1] and [2] in Figure 4), those 
that bypass the aggregator (flows [1], [2] and [3]), and those 
for which the aggregator can respond from its limited cache 
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(some flows [4] and [5]) perform noticeably faster, even us- 
ing the gateway client. In addition, many well understood 
techniques including batch harvesting, caching, and recur- 
rent refreshing are available to improve the performance of 
the aggregator and fundamentally improve response times. 

As an illustration of our results, Figure 5 shows two nav- 
igations conducted on November 2 2009, around 16:25:00 
UTC: one in real-time, and one in time travel mode with a 
datetime set to October 12 2009 16:25:00 UTC. The captions 
in the figure also indicate the flow of the HTTP interactions 
in the experimental environment as indicated in Figure 4. 
The DT-conneg framework allows a re-navigation of both in 
the future. It suffices to use these respective datetimes as 
the DT-conneg value, and hope that archives have records 
of the resources involved'^. 
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Figure 4: The Memento Experiment environment. 


5. DISCUSSION 


In this section, we touch upon issues pertaining to the Me- 
mento solution that require further attention. The seamless 
integration of archives in regular web navigation provided by 
DT-conneg allows exploring novel ways to address common 
problems that result from the dynamics of the Web. Some 
of these have tentatively been explored in our experiment. 
Consider the following cases: 

A URI-R vanishes, but the server that used to serve it 
is still operational: In this case, the server should still issue 
the redirect to a TimeGate upon detection of the DT-conneg 
request. This allows seamless access to a Memento of URI- 
R, even if the server no longer hosts the original. 

A domain vanishes: The client is looking for a current rep- 
resentation of a URI-R that was hosted by the domain, but 
fails. The client resorts to interaction with other archives 
or with a TimeBundle aggregator and arrives at the most 
recent Memento of the resource. 


My it at http://mementoweb.org/demo/client1 


A domain is taken over by a new custodian: The new cus- 
todian adheres to other policies regarding which archive to 
redirect a DT-conneg request. The client understands from 
the X-Archive-Interval returned by that archive of choice, 
that it does not cover the time range in which the previous 
custodian operated the domain. The client resorts to inter- 
action with other archives or with a TimeBundle aggregator 
and arrives at an appropriate Memento. 

Two aspects related to the integration of the proposed 
Memento solution into the existing Web infrastructure re- 
quire attention. First, when issuing a request with an X- 
Accept-Datetime header to a server that hosts the original 
resource URI-R, all caches between the client and the server 
must be bypassed in order to avoid retrieving a current rep- 
resentation of URI-R. In our experiment, we enforced this 
behavior through a combination of two client request head- 
ers: ^Cache-Control: no-cache" to force cache revalidation 
and "If-Modified-Since: Thu, 01 Jan 1970 00:00:00 GMT” 
to make sure that revalidation fails. Further research is re- 
quired to find an alternative for this admittedly inelegant ap- 
proach. Ideally, it should leverage existing caching practice 
but extend it in such a way that caches are only bypassed in 
D'T-conneg when essential, but still used whenever possible 
(e.g., to deliver Mementos). Second, when it comes to list- 
ing variant resources in response headers, the DT-conneg 
framework cannot operate according to the letter of RFC 
2295. Indeed, the RFC states: “If a response from a trans- 
parently negotiable resource includes an Alternates header, 
this header MUST contain the complete variant list bound 
to the negotiable resource." This mandate is based on a 
perspective expressed in the RFC that "it is expected that 
a typical transparently negotiable resource will have 2 to 10 
variants, depending on its purpose." Clearly, TimeGates as 
proposed in the DT-conneg framework can have many more 
than 10 Mementos. We do not think this makes the con- 
neg framework inapplicable to the datetime dimension, but 
rather we believe DT-conneg introduces a challenge that the 
authors of the RFC did not anticipate ten years ago. As de- 
scribed, we propose a solution based on a sizeable Alternates 
header combined with an HTTP Link header that leads to a 
complete list of variants; other options should be explored. 

An interesting characteristic of the D'T-conneg framework 
requires more explicit attention. When requesting a Me- 
mento for a page that contains links to external pages, or 
embedded resources such as images or videos, each of those 
are requested with DT-conneg from the respective servers 
that host/hosted the URI-R of those resources. This is 
a core characteristic that the proposed time travel frame- 
work shares with regular web navigation. It should be noted 
that this is not the current behavior of pages stored in web 
archives. Indeed, in order to avoid filling out an archived 
page with current representations of embedded resources, 
web archives rewrite URIs in archived pages to point back 
into the archive at archived representations of those resources. 
The same happens with links in archived pages, effectively 
turning the archive into an island isolated from the rest 
of the Web. The upside of this approach is that archived 
pages are self-contained: the page and its embedded re- 
sources were typically crawled around the same time and 
hence the archived page is likely to be a faithful reconstruc- 
tion of what the original looked like at the time of the crawl. 
The drawback of the approach is that navigation is restricted 
to the archive's island. Navigating beyond it to obtain an 
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archived version of a linked resource that is not available in 
the archive but might be available elsewhere on the Web, is 
not possible. Further exploration is required to arrive at a 
strategy for web archives that would at the same time ad- 
here to the self-containedness principle and allow external 
navigation using the DT-conneg framework when beneficial. 
Another challenge pertains to selecting a Memento that 
best meets the client’s conneg preferences. There are two 
aspects to this problem. The first relates to the archival 
datetime of the Memento that an archive should return in 
response to a datetime expressed by a time travel client. 
For certain archives the choice is straightforward. Indeed, 
transactional archives and servers such as Wikipedia know 
exactly during which time interval a certain Memento func- 
tioned as the active representation of URI-R (cf. the X- 
Datetime-Validity discussed in Section 3.4). Hence, they 
can return the Memento that was active at the datetime 
specified by the client. However for resources not hosted 
by such servers, it will be rare that any archive has a Me- 
mento that perfectly matches the client’s preference. In this 
case, an archive (or a TimeBundle aggregator) must make 
a choice. A typical approach used by existing web archives 
is to choose the Memento that is the “closest” in time, re- 
gardless of whether its archival datetime is before or after 
the requested datetime. But this approach is challenged 
when pages have embedded resources. The more resources 
required to render a page, the more variation there will be 
between the requested datetime and the archival datetimes 
of available Mementos. As a matter of fact, when not being 
sensible about the selection of Mementos, the resulting page 
may never actually have existed. A second challenge relates 
to multi dimensional conneg that involves the datetime di- 
mension. Current conneg algorithms’? deal with variant 
selection in the dimensions specified in RFC 2295. These 
would need to be revised to include the datetime dimension: 
if a client requests an HTML Memento for a specific date- 
time, but only a pdf is available, what should the archival 
server do? Research is required to explore both problems. 


6. RELATED WORK 


The goal of adding a temporal aspect to web navigation 
has been explored in projects that focus on user interface 
enhancement. The Zoetrope project [1] provides a rich in- 
terface for querying and interacting with a set of archived 
versions of selected seed pages. The interface leverages a lo- 
cal archive that is assembled by frequently polling those seed 
pages. The Past Web Browser [16] provides a simpler level of 
interaction with changing pages, but it is restricted to nav- 
igating existing web archives such as the Internet Archive. 
And DifflE is a plug-in for Internet Explorer that empha- 
sizes web content that changed since a user’s previous visit 
by leveraging a dedicated client cache [32]. None of these 
projects propose protocol enhancements but rather use ad- 
hoc techniques to achieve their goals. All could benefit from 
D'T-conneg as a standard mechanism for accessing prior rep- 
resentations of resources. 

Some projects have dealt with the problem of disappeared 
web pages and finding archived or replacement copies on 
the Web. The use of lexical signatures as search engine 
query terms was proposed as a way to find content that had 
moved from its original URI [28, 29]. This approach was 
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later applied to search for content in web archives [13, 17]. 
Also, when a *HTTP 404 Not Found" occurs, the ErrorZilla 
FireFox plug-in! presents a user with a search page allowing 
her to find disappeared pages in web archives, and the UK 
National Archive's server plug-in redirects the client to an 
archive of its choice. As suggested (Section 5), in the DT- 
conneg framework a client could intelligently react to 404s, 
and when doing so leverage available re-finding approaches. 

To the best of our knowledge, very little research has ex- 
plored a protocol-based solution to augment the Web with 
time travel capabilities. T'TApache [8] introduced a modi- 
fied version of Apache that stored archived representations 
in a local transactional archive (similar to the configura- 
tion illustrated in Figure 1). Ad-hoc RPC-style mechanisms 
were used to access archived representations given the URI 
of their original, e.g. “page.html?02-Nov-2009” and 
^page.html?now". "This approach reveals the local scope 
of the problem addressed by T'TApache, as opposed to the 
global perspective taken by the proposed D'T-conneg frame- 
work. Indeed, the query components are issued against a 
Specific server, and are not maintained when a client moves 
to another server as is the case with the X-Accept-Datetime 
header of DT-conneg. TTApache also allowed addressing 
archived representations using version numbers in query 
components rather than datetimes. This capability is sim- 
ilar to the deprecated “Content-Version” header field from 
RFC 2068 [10] and other, similar expired proposals (e.g., 
(27]). Such versioning features have not found wide-spread 
adoption, presumably because their address space is tied to 
a specific resource or server, and not universal like the date- 
time of DT-conneg. 


7. CONCLUSIONS 


In Web Archiving [22], Julien Masanés expresses a vision 
of a global grid of web archives realized by interconnecting 
existing and future ones: 


Such a grid should link Web archives so that they 
together form one global navigation space like 
the live Web itself. This is only possible if they 
are structured in a way close enough to the orig- 
inal Web and if they are openly accessible. 


We could not agree more, and feel that our Memento solu- 
tion presents a significant step towards achieving this vision. 
But our approach reaches beyond it. Indeed, the navigation 
space that results from our proposal is not "like the live Web 
itself", it is the Web itself, as regular navigation and time 
travel are integrated. Also, it does not restrict the global 
archival grid to web archives but incorporates servers (such 
as CMS) on the live Web that host archival content. The 
Memento solution is capable of realizing this, and does not 
disrupt firmly established HTTP practice. Rather, it adds 
to it an orthogonal time dimension. Moreover, the Memento 
solution does not disrupt existing web archives or their es- 
tablished operating principles, but leverages both by tightly 
integrating them into the web. Time travel can be ours. 
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